SSBB With BA Final Examination (Section - II)-Resubmission

Author

Mallesham Yamulla

Question-1:

On February 24th 2024, Govt. India Published a report on the Household Consumption Expenditure Survey 2022-2023 Fact Sheet. The link is given for later/future reference.

WebLink

The survey collects the consumption of Goods and Services by households on a sample basis to understand the consumption expenditure pattern, standard of living and wellbeing of households. This data is used for a number of purposes of Govt. planning. One of the measures which is estimated from the surveyed data is MPCE (Monthly Per-capita Consumption Expenditure) or how much an average Indian Urban and Rural Household spend in a month. The attached excel file (National Family Household Survey Data - Unsupervised Learning.xlsx) contains the relevant data for analysis. There are three sheets in the excel file containing

  • State wise MPCE data along with number of households (Rural and Urban) in the state (sheet-1&2)

  • Item Group wise Percentage of MPCE for Rural and India. (Sheet-3)

A. Perform basic descriptive analytics and visualization on the data and give your inference. Your analytics and inference need to be very focused so that Govt. of India can take the inputs for the decision making in planning and any interventions.

B. Using a suitable unsupervised learning algorithm identify whether there is any similarity among the states with respect to Average MPCE separately for urban and rural and both rural and urban combined. How many segments/groups the states can be made for further analysis. Identify the group of states with close to each other average MPCE.

C. Perform similar analysis as above for item group using both rural and urban parentage MPCE and find the most optimal item segments.

E. Finally suggest three action points which Govt. of India may need to initiate to address the variation observed in average MPCE among states and focusing on the item group consumption expenditure to reduce the load on the households.

F. Any special/innovative analytics you may wish to run on the dataset to find hidden, interesting, and useful information for Govt. of India.

1.Data Importing and Preparations.

In preparation for data analysis, the MPCE data has been loaded into three distinct tabular dataframes. I have reviewed the initial ten rows of each dataframe to become familiar with the data structure.

Average MPCE of each state in the year 2022-2023 (Rural India)

state no_of_house_holds avg_mpce_rural
0 Andhra Pradesh 95813 4996
1 Arunachal Pradesh 1953 5300
2 Assam 63174 3546
3 Bihar 198464 3454
4 Chhattisgarh 47120 2575
5 Delhi 510 6595
6 Goa 1569 7388
7 Gujarat 73816 3820
8 Haryana 33394 4912
9 Himachal Pradesh 17260 5573

Average MPCE of each state in the year 2022-2023 (Urban India)

state no_of_house_holds avg_mpce_urban
0 Andhra Pradesh 45443 6877
1 Arunachal Pradesh 602 8649
2 Assam 9279 6210
3 Bihar 18955 4819
4 Chhattisgarh 12298 4557
5 Delhi 30965 8250
6 Goa 2270 8761
7 Gujarat 58955 6630
8 Haryana 22330 7948
9 Himachal Pradesh 2601 8083

Percentage Cotribution of MPCE to Each Item Group

item_group rural_india_in_202223 urban_india_in_202223
0 cereal 6.90 4.49
1 cereal substitutes 0.02 0.02
2 gram 0.24 0.18
3 pulses and pulse products* 1.73 1.20
4 sugar & salt 0.92 0.60
5 milk and milk products 8.14 7.15
6 vegetables 5.26 3.76
7 fruits (fresh) 2.48 2.48
8 fruits (dry) 1.15 1.29
9 egg, fish & meat 4.80 3.54

2. Exploratory Data Analysis.

2.1 How are the house holds data distributed in Rural Area?.

count mean std min 25% 50% 75% max
no_of_house_holds 36.0 53644.44 69193.94 26.0 1484.0 25765.0 87894.5 307870.0

Inferences:

These points are drawn based out of sample collected across 36 states in india.

  • The average number of households is 53,644, but there is significant variation across states as indicated by the standard deviation.
  • Half (50%) of the households have roughly 25,000.
  • Three-quarters (75%) of the households have 87,000 or fewer.
  • A small number of states have households exceeding 100,000.

2.2 How are the average MPCE data distributed collected in Rural Area?.

count mean std min 25% 50% 75% max
avg_mpce_rural 36.0 4833.22 1391.89 2575.0 3751.5 4649.5 5486.0 7787.0

Inferences:

These points are drawn based out of sample collected across 36 states in india.

  • The average Monthly Per Capita Expenditure (MPCE) in India is 4,833. However, there is significant variation in spending across households, as indicated by the standard deviation.

  • Half Spend Less, Half Spend More: The median MPCE is 4,649, which is slightly lower than the average. This means that half of the households spend more than ₹4,649 per month on consumption, and the other half spend less.

  • The MPCE follows a near-normal distribution, indicating a bell-shaped curve where most households’ spending falls around the average.

2.3 Grouping the no of house hold data as in Rural Area and See how are their avg MPCE per each group?

The previous analysis highlighted inconsistencies in the number of households surveyed across different states. Since states with larger populations likely have more households, a consistent sample size per state might not have been achievable.

Grouping States by Household Count:

To address this, I have categorized the states into four groups based on the number of households in each state. The specific thresholds used for this grouping are as follows:

  • A:1-1499
  • B:1500-25999
  • C:26000-87999
  • D:>=88000

  • Interestingly Each group has the same number of data points (9) for balanced representation.
house_hold_group avg_mpce_rural
min max mean median
0 A:1-1499 4062 7787 6146.78 6595.0
1 B:1500-25999 3530 7388 4999.67 4721.0
2 C:26000-87999 2575 5960 4103.00 3820.0
3 D:>=88000 3158 5457 4083.44 4076.0

Inferences

  • Interestingly, Group A (states with the fewest households, ranging from 1 to 1,499) has the highest average MPCE at 6,146.
  • In contrast, Groups C and D exhibit a similar average MPCE of around 4,100.
  • Group B (states with 1,500 to 25,999 households) falls in between, with an average MPCE of 5,000.

2.4 How are the house holds data distributed in Urban Area?.

count mean std min 25% 50% 75% max
no_of_house_holds 36.0 24861.89 31406.13 79.0 1672.5 10788.5 41642.25 119824.0

Inferences

These points are drawn based out of sample collected across 36 states in india.

  • The average number of households is 24,861 but there is significant variation across states as indicated by the standard deviation.
  • Half (50%) of the households have roughly 10788.
  • Three-quarters (75%) of the households have 42,000 members or fewer.
  • A small number of states have households exceeding 100,000 members.

2.5 How are the average MPCE data distributed collected in Urban Area?.

count mean std min 25% 50% 75% max
avg_mpce_urban 36.0 7062.14 1853.39 4557.0 5855.25 6780.0 7822.75 12577.0

Inferences:

These points are drawn based out of sample collected across 36 states in india.

  • The average Monthly Per Capita Expenditure (MPCE) in Urban is 7062. However, there is significant variation in spending across households, as indicated by the standard deviation.

  • Half Spend Less, Half Spend More: The median MPCE is 6780, which is slightly lower than the average. This means that half of the households spend more than 6,780 per month on consumption, and the other half spend less.

  • Thre are few assignable causes found in MPCE of Urban, The MPCE seems to not following near-normal distribution and the data exhibits a right skewed distribution.

2.6 Grouping the no of house hold data as in Urban Area and See how are their avg MPCE per each group?

  • In contrast to the rural household groups with equal data points, the urban groups exhibit variation. For instance, Group A (1-1499 households) has 9 data points, while Group B has a higher number of data points at 15. This implies there are more urban states with a household count between 1,500 and 25,999 compared to other urban household categories.
house_hold_group avg_mpce_urban
min max mean median
0 A:1-1499 5511 12125 7817.67 7159.0
1 B:1500-25999 4557 12577 6870.07 6577.0
2 C:26000-87999 5011 8251 6640.20 6753.5
3 D:>=88000 6683 7742 7212.50 7212.5

2.7 How is the data:the percentage contribution of MPCE is distributed in Rural ?

count mean std min 25% 50% 75% max
rural_india_in_202223 27.0 3.7 2.69 0.02 1.1 3.52 5.22 9.41

Inferences:

  • In Rural areas, the average MPCE across different item groups is roughly similar, regardless of whether we consider the mean or median values. This suggests a symmetrical distribution of spending patterns. On average, individuals in urban areas spend approximately 3.7 per item group.

2.8 How is the data:the percentage contribution of MPCE is distributed in Urban ?

count mean std min 25% 50% 75% max
urban_india_in_202223 27.0 3.7 2.79 0.02 1.43 3.54 5.8 10.53

Inferences

  • In urban areas too, the average MPCE across different item groups is roughly similar, regardless of whether we consider the mean or median values. This suggests a symmetrical distribution of spending patterns. On average, individuals in urban areas spend approximately 3.7 per item group.

2.9 How is the the percentage contribution of MPCE per Eachh Item Group in Rural. ?

Inferences:

  • Higher Spending: The data suggests that Indians, particularly in urban areas allocate a significant portion of their monthly expenditure (MPCE) towards Food and Beverages, particularly milk and milk products. Additionally, conveyance appears to be another major spending category.

  • Lower Spending: Rent, footwear, taxes & cesses, entertainment, and dry fruits seem to represent a smaller share of MPCE.

  • Healthcare Concerns: The data also indicates potentially concerning levels of spending on hospitalizations.

2.10 How is the the percentage contribution of MPCE per Eachh Item Group in Urban. ?

Inferences

  • While urban and rural populations allocate MPCE to the same item groups, the spending levels for these groups vary slightly, with urban residents spending somewhat more.

2.11 How does the Pareto Principle apply to MPCE in Rural areas? What percentage of Item Groups contribute to 80% of total spending?.

Inferences

  • In rural areas In 80%, around 50% of the average monthly expenditure is allocated towards essential goods like beverages, milk and milk products, conveyance, cereals, durable goods, fuel and light, vegetables, and clothing and bedding.

  • The remaining 30% of the monthly expenditure covers a broader range of categories, including toilet articles & other household consumables, consumer services (such as repairs or hairdressing), egg, fish & meat, tobacco products, and medical expenses (both hospitalization and non-hospitalization).

2.12 How does the Pareto Principle apply to MPCE in Urban areas? What percentage of Item Groups contribute to 80% of total spending?.

Inferences

  • In Urban areas In 80%, around 50% of the average monthly expenditure is allocated towards essential goods like beverages, milk and milk products, conveyance, durable goods, fuel and light, and consumer services.

  • The remaining 30% of the monthly expenditure covers a broader range of categories, education, toilet articles, clothing & bedding, cereal, vegetables and medical.

In a conclusion

  • Households in both rural and urban areas dedicate roughly half of their monthly spending to essential goods like food, transportation, housing, utilities, and some basic services. The remaining portion is allocated to a wider range of categories, with some variation between the two areas. Rural areas spend more on necessities like clothing and fuel, while urban areas prioritize education and potentially more diverse services.

3.Unsupervised Analysis.

Introduction to Unsupervised Learning Algorithms.

Unsupervised learning is about finding insights & patterns hidden in the data, Unlike regression or classification, we don’t care about splitting our data into train / test sets and making predictions, we just care about understanding the relationships in our data.

1.K-Means:

K-Means Clustering is a popular algorithm which assigns each observation in a data set to a specific cluster, where “K” represents the number of clusters

Here’s how it works:

  1. Select “K” arbitrary locations in a scatter plot as cluster centers (or centroids), and assign each observation to a cluster based on the closest centroid.
  2. Recalculate and relocate each centroid to the mean of the observations assigned to it, then reassign each observation to its new closest centroid
  3. Repeat the process until observations no longer change clusters.

2.Hierarchical Clustering:

Hierarchical Clustering is a clustering technique that creates clusters by grouping similar data points together

3.DBSCAN:

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering technique that identifies clusters based on the density of data points

Here’s how it works:

  1. Select a radius (eps) and a minimum number of points (min_samples)
  2. Within a scatter plot, label each point as one of the following:
    • Core point – has the minimum number of points within its radius (in a dense region)
    • Border point – does not have the minimum number of points within its radius, but has at least one core point within its radius (on the outskirts of clusters)
    • Noise point (outlier) – does not have a core point within its radius (isolated points)

Rural - No of house holds and Average MPCE

  • I have choses K as 2 to do carry out K-means clustering on the Rural data of no of house holds and average MPCE as below.
KMeans(n_clusters=2, random_state=2024)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Inferences:

  • From the above scatterplot it is seen that the house holds between 1-50K and their AVG MPCE falls between 1500 and 8000 are clustered into one.
  • The remaining larger no of housing i.e more than 80K are fit into 2nd cluster.
no_of_house_holds avg_mpce_rural
0 143205.00 3974.70
1 19198.08 5163.42

Inferences:

  • Cluster 2 has a higher average MPCE (5163) compared to Cluster 1 (3974). This suggests a difference of 1189 in average spending.

  • While the average MPCE in Cluster 1 is 3974, Cluster 2 has a slightly higher average of 5163. The difference between the two clusters is around 1189. More neutral phrasing:

  • The average MPCE in Cluster 1 is 3974, while in Cluster 2 it is 5163.

To gain insights from the data, I will identify the optimal number of clusters (K) by minimizing the within-cluster sum of squares (WCSS). WCSS measures how spread out the data points are within each cluster, which can be done through Inertia Metric.

  • Based on the inertia plot, using 5 clusters seems effective for revealing patterns in the rural data.
KMeans(n_clusters=5, random_state=2024)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

  • The silhouette plot helps visualize how well data points are clustered in k-means. By analyzing the distribution of silhouette values, we can assess the quality of the clustering and potentially identify areas for improvement.

  • The silhouette plot indicates a positive outcome for the 5-cluster solution. Four clusters have silhouette scores above 0.7, signifying that points within those clusters are well-positioned and have a significant separation from points in other clusters. This suggests good cluster separation and well-defined clusters.

no_of_house_holds avg_mpce_rural
0 172293.67 3645.67
1 48997.86 4301.57
2 307870.00 3277.00
3 4574.94 5573.22
4 97302.14 4193.29

Inferences:

  • The cluster number 3 has the highest average mpce i.e 5573 compare to the remaining 4 clusters.
  • The cluster number 2 has lowest average mpce i.e 3277 within the 5 clusters.

Let’s explore the data patterns within each of the five clusters by examining the rural states belonging to those clusters.

Rural-Cluster#0:

state no_of_house_holds avg_mpce_rural house_hold_group cluster state_short_code
3 Bihar 198464 3454 D:>=88000 0 BH
14 Maharashtra 150360 4076 D:>=88000 0 MH
28 West Bengal 168057 3407 D:>=88000 0 WB

Rural households from Bihar, Maharashtra, and West Bengal appear to be grouped together in one cluster. This cluster has a larger number of households and a lower average MPCE of 3645, suggesting a potential association with a lower standard of living.

Rural-Cluster#1:

state no_of_house_holds avg_mpce_rural house_hold_group cluster state_short_code
2 Assam 63174 3546 C:26000-87999 1 AS
4 Chhattisgarh 47120 2575 C:26000-87999 1 CT
8 Haryana 33394 4912 C:26000-87999 1 HR
10 Jharkhand 55031 2796 C:26000-87999 1 JH
12 Kerala 46172 5960 C:26000-87999 1 KL
20 Punjab 40039 5363 C:26000-87999 1 PB
24 Telangana 58055 4959 C:26000-87999 1 TS

Rural households from Assam, Chattisgarh, Haryana, Jharkhand,Kerala, punjab and Telangana appear to be grouped together in one cluster. This cluster has moderate average MPCE of 4301, suggesting a potential association with a better standard of living.

Rural-Cluster#2:

state no_of_house_holds avg_mpce_rural house_hold_group cluster state_short_code
27 Uttar Pradesh 307870 3277 D:>=88000 2 UP

Rural households from Uttar Pradesh only to be grouped together in one cluster. This cluster has the least average MPCE of 3277, suggesting a potential association with a poor standard of living.

Rural-Cluster#3:

state no_of_house_holds avg_mpce_rural house_hold_group cluster state_short_code
1 Arunachal Pradesh 1953 5300 B:1500-25999 3 AR
5 Delhi 510 6595 A:1-1499 3 DL
6 Goa 1569 7388 B:1500-25999 3 GA
9 Himachal Pradesh 17260 5573 B:1500-25999 3 HP
15 Manipur 4440 4370 B:1500-25999 3 MN
16 Meghalaya 5610 3530 B:1500-25999 3 ME
17 Mizoram 1229 5243 A:1-1499 3 MZ
18 Nagaland 2341 4457 B:1500-25999 3 NL
22 Sikkim 1189 7787 A:1-1499 3 SK
25 Tripura 7032 5301 B:1500-25999 3 TR
26 Uttarakhand 18136 4721 B:1500-25999 3 UK
29 Andaman & N Islands 625 7332 A:1-1499 3 AN
30 Chandigarh 82 7467 A:1-1499 3 CH
31 Dadra & Nagar Haveli and Daman & Diu 716 4229 A:1-1499 3 DD
32 Jammu & Kashmir 18123 4357 B:1500-25999 3 JK
33 Ladakh 371 4062 A:1-1499 3 LA* (unofficial)
34 Lakshadweep 26 5979 A:1-1499 3 LD
35 Puducherry 1137 6627 A:1-1499 3 PY

Rural households from the smaller states to be grouped together in one cluster. This cluster has the the best average MPCE of 5573, suggesting a potential association with a standard of living in small states.

Rural-Cluster#4:

state no_of_house_holds avg_mpce_rural house_hold_group cluster state_short_code
0 Andhra Pradesh 95813 4996 D:>=88000 4 AP
7 Gujarat 73816 3820 C:26000-87999 4 GJ
11 Karnataka 89231 4578 D:>=88000 4 KA
13 Madhya Pradesh 114334 3158 D:>=88000 4 MP
19 Odisha 87449 2996 C:26000-87999 4 OR
21 Rajasthan 110176 4348 D:>=88000 4 RJ
23 Tamil Nadu 110296 5457 D:>=88000 4 TN

Rural households from Andhrpradhesh, gujaraj, karnataka, MP, Odisha, Rajasthan and tamilnadu appear to be grouped together in one cluster. This cluster has second better average MPCE of 4193, suggesting a potential association with a better standard of living.

Urban - No of house holds and Average MPCE.

I will carry out the same clustering analysis on Urban data as follows.

  • Based on the inertia plot, using 5 clusters seems effective for revealing patterns in the urban data as well.
KMeans(n_clusters=5, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

  • The silhouette plot indicates a positive outcome for the 5-cluster solution. Four clusters have silhouette scores above 0.6, signifying that points within those clusters are well-positioned and have a significant separation from points in other clusters. This suggests good cluster separation and well-defined clusters of Urban data as well.

no_of_house_holds avg_mpce_urban
0 41933.83 6910.17
1 1868.18 7831.18
2 110502.00 7212.50
3 68454.25 6235.25
4 16692.14 5754.29

Inferences:

  • The cluster number 1 has the highest average mpce i.e 7831 compare to the remaining 4 clusters.
  • The cluster number 4 has lowest average mpce i.e 5744 within the 5 clusters.

Let’s explore the data patterns within each of the five clusters by examining the urban states belonging to those clusters.

Cluster#0:

state no_of_house_holds avg_mpce_urban house_hold_group cluster state_short_code
0 Andhra Pradesh 45443 6877 C:26000-87999 0 AP
5 Delhi 30965 8250 C:26000-87999 0 DL
12 Kerala 42846 7102 C:26000-87999 0 KL
13 Madhya Pradesh 41241 5011 C:26000-87999 0 MP
21 Rajasthan 39106 5970 C:26000-87999 0 RJ
24 Telangana 52002 8251 C:26000-87999 0 TS

Urban households from AP, Delhi, Kerala,MP,Rajasthan and Telangana appear to be grouped together in one cluster. This cluster has a moderate average MPCE of 6910, suggesting a potential association with a better standard of living.

Cluster#1:

state no_of_house_holds avg_mpce_urban house_hold_group cluster state_short_code
1 Arunachal Pradesh 602 8649 A:1-1499 1 AR
6 Goa 2270 8761 B:1500-25999 1 GA
9 Himachal Pradesh 2601 8083 B:1500-25999 1 HP
15 Manipur 1767 4902 B:1500-25999 1 MN
16 Meghalaya 1112 6450 A:1-1499 1 ME
17 Mizoram 1033 7664 A:1-1499 1 MZ
18 Nagaland 979 7159 A:1-1499 1 NL
22 Sikkim 608 12125 A:1-1499 1 SK
25 Tripura 2083 7473 B:1500-25999 1 TR
26 Uttarakhand 5353 7034 B:1500-25999 1 UK
29 Andaman & N Islands 498 10268 A:1-1499 1 AN
30 Chandigarh 2521 12577 B:1500-25999 1 CH
31 Dadra & Nagar Haveli and Daman & Diu 1389 6306 A:1-1499 1 DD
32 Jammu & Kashmir 6347 6200 B:1500-25999 1 JK
33 Ladakh 79 6227 A:1-1499 1 LA* (unofficial)
34 Lakshadweep 80 5511 A:1-1499 1 LD
35 Puducherry 2437 7741 B:1500-25999 1 PY

Urban households from Small states appear to be grouped together in one cluster. This cluster has a moderate average MPCE of 7831, suggesting a potential association with the best standard of living.

Cluster#2:

state no_of_house_holds avg_mpce_urban house_hold_group cluster state_short_code
14 Maharashtra 119824 6683 D:>=88000 2 MH
23 Tamil Nadu 101180 7742 D:>=88000 2 TN

Urban households from Maharasthra and Tamilnadu to be grouped together in one cluster. This cluster has a moderate average MPCE of 7212, suggesting a potential association with a better standard of living.

Cluster#3:

state no_of_house_holds avg_mpce_urban house_hold_group cluster state_short_code
7 Gujarat 58955 6630 C:26000-87999 3 GJ
11 Karnataka 61965 7781 C:26000-87999 3 KA
27 Uttar Pradesh 85072 5104 C:26000-87999 3 UP
28 West Bengal 67825 5426 C:26000-87999 3 WB

Urban households from Gujarat,Karnataka,UP and WB to be grouped together in one cluster. This cluster has a moderate average MPCE of 6235, suggesting a potential association with a nominal standard of living.

Cluster#4:

state no_of_house_holds avg_mpce_urban house_hold_group cluster state_short_code
2 Assam 9279 6210 B:1500-25999 4 AS
3 Bihar 18955 4819 B:1500-25999 4 BH
4 Chhattisgarh 12298 4557 B:1500-25999 4 CT
8 Haryana 22330 7948 B:1500-25999 4 HR
10 Jharkhand 14021 4946 B:1500-25999 4 JH
19 Odisha 15796 5223 B:1500-25999 4 OR
20 Punjab 24166 6577 B:1500-25999 4 PB

Urban households from Assam,Bihar Chattisgarh,Haryana,Jharkhand,Odisha and Punjab be grouped together in one cluster. This cluster has a lease average MPCE of 5754, suggesting a potential association with a poor standard of living.

Item Group: Percentage Cotribution of MPCE to Each Item.

Glance at item group data

item_group rural_india_in_202223 urban_india_in_202223
0 cereal 6.90 4.49
1 cereal substitutes 0.02 0.02
2 gram 0.24 0.18
3 pulses and pulse products* 1.73 1.20
4 sugar & salt 0.92 0.60

This item-group data, segmented by rural and urban locations, will be analyzed using various clustering techniques to identify potential patterns.

KMeans.

  • Based on the inertia plot, using 5 clusters seems effective for revealing patterns in the item-group data.
KMeans(n_clusters=5, random_state=2024)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

  • The silhouette plot indicates a positive outcome for the 5-cluster solution. Four clusters have silhouette scores above 0.5, signifying that points within those clusters are well-positioned and have a significant separation from points in other clusters.

rural_india_in_202223 urban_india_in_202223
0 5.17 4.79
1 0.76 0.73
2 7.93 8.33
3 2.99 2.25
4 0.76 6.49

Inferences:

  • The cluster number 2 has the highest average percentage of contributions from MPCE to the item-groups i.e 7.93 in rural and 8.33 in urban compare to the remaining 4 clusters.
  • The cluster number 1 and 2 have least average percentage of contributions i.e 0.76 in rural ad 0.73 in urban within the 5 clusters.

Cluster#0

item_group rural_india_in_202223 urban_india_in_202223 cluster
0 cereal 6.90 4.49 0
6 vegetables 5.26 3.76 0
9 egg, fish & meat 4.80 3.54 0
14 fuel and light 6.51 6.20 0
15 toilet articles & other household consumables 5.01 4.93 0
16 education 3.23 5.73 0
18 medical (non- hospitalization) 4.66 3.96 0
20 consumer services excluding conveyance 4.96 5.86 0
24 clothing & bedding 5.18 4.62 0

Cluster 0 appears to group item categories (cereals, vegetables, egg, fish & meat, fuel & light, toilets, education, non-hospitalization medication, consumer services, and clothing & bedding) where the average MPCE contributions from rural and urban areas are similar.

Cluster#1

item_group rural_india_in_202223 urban_india_in_202223 cluster
1 cereal substitutes 0.02 0.02 1
2 gram 0.24 0.18 1
3 pulses and pulse products* 1.73 1.20 1
4 sugar & salt 0.92 0.60 1
8 fruits (dry) 1.15 1.29 1
21 entertainment 1.06 1.57 1
23 other taxes & cesses 0.12 0.24 1
25 footwear 0.85 0.76 1

Cluster 1 appears to group item categories (cereals substistues, gram, pulse product, sugar &salt, fruits, enteraiment, other taxes and footwear) where the average MPCE contributions from rural and urban areas are similar.

Cluster#2

item_group rural_india_in_202223 urban_india_in_202223 cluster
5 milk and milk products 8.14 7.15 2
12 beverages, processed food#, etc. 9.41 10.53 2
19 conveyance 7.38 8.51 2
26 durable goods 6.79 7.13 2

Cluster 2 appears to group item categories (milk product, beverages, processed foods, conveynace and durable goods) where the average MPCE contributions from urban areas is little higher than rural

Cluster#3

item_group rural_india_in_202223 urban_india_in_202223 cluster
7 fruits (fresh) 2.48 2.48 3
10 edible oil 3.52 2.35 3
11 spices 2.92 2.11 3
13 pan, tobacco & intoxicants 3.70 2.41 3
17 medical (hospitalization) 2.31 1.89 3

Cluster 3 appears to group item categories (fresh fruits, edibile oil, spices, tobaco and medical hospitalization) where the average MPCE contributions from rural and urban areas are similar.

Cluster#4

item_group rural_india_in_202223 urban_india_in_202223 cluster
22 rent 0.76 6.49 4

Interestingly, Cluster 4 consists solely of the “rent” category. The MPCE contribution for rent appears to be higher in urban areas compared to rural areas within this cluster.

I will now use hierarchical and DBSCAN clustering to see if they reveal similar patterns.

Agglomarative.

AgglomerativeClustering(n_clusters=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Cluster#0

item_group rural_india_in_202223 urban_india_in_202223 cluster
5 milk and milk products 8.14 7.15 0
12 beverages, processed food#, etc. 9.41 10.53 0
14 fuel and light 6.51 6.20 0
19 conveyance 7.38 8.51 0
26 durable goods 6.79 7.13 0

Cluster#1

item_group rural_india_in_202223 urban_india_in_202223 cluster
1 cereal substitutes 0.02 0.02 1
2 gram 0.24 0.18 1
3 pulses and pulse products* 1.73 1.20 1
4 sugar & salt 0.92 0.60 1
8 fruits (dry) 1.15 1.29 1
21 entertainment 1.06 1.57 1
23 other taxes & cesses 0.12 0.24 1
25 footwear 0.85 0.76 1

Cluster#2

item_group rural_india_in_202223 urban_india_in_202223 cluster
0 cereal 6.90 4.49 2
6 vegetables 5.26 3.76 2
9 egg, fish & meat 4.80 3.54 2
15 toilet articles & other household consumables 5.01 4.93 2
16 education 3.23 5.73 2
18 medical (non- hospitalization) 4.66 3.96 2
20 consumer services excluding conveyance 4.96 5.86 2
24 clothing & bedding 5.18 4.62 2

Cluster#3

item_group rural_india_in_202223 urban_india_in_202223 cluster
7 fruits (fresh) 2.48 2.48 3
10 edible oil 3.52 2.35 3
11 spices 2.92 2.11 3
13 pan, tobacco & intoxicants 3.70 2.41 3
17 medical (hospitalization) 2.31 1.89 3

Cluster#4

item_group rural_india_in_202223 urban_india_in_202223 cluster
22 rent 0.76 6.49 4

DBSCAN

Eps Min Samples Number of Clusters Number of Noise Points Silhouette Score
141 1.8 7 2 7 0.425032
67 0.9 5 2 14 0.397568
64 0.9 2 3 9 0.390227
116 1.5 6 2 8 0.386525
112 1.5 2 3 5 0.372150
DBSCAN(eps=1.8, min_samples=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Cluster#0

item_group rural_india_in_202223 urban_india_in_202223 cluster
1 cereal substitutes 0.02 0.02 0
2 gram 0.24 0.18 0
3 pulses and pulse products* 1.73 1.20 0
4 sugar & salt 0.92 0.60 0
7 fruits (fresh) 2.48 2.48 0
8 fruits (dry) 1.15 1.29 0
10 edible oil 3.52 2.35 0
11 spices 2.92 2.11 0
13 pan, tobacco & intoxicants 3.70 2.41 0
17 medical (hospitalization) 2.31 1.89 0
21 entertainment 1.06 1.57 0
23 other taxes & cesses 0.12 0.24 0
25 footwear 0.85 0.76 0

Cluster#1

item_group rural_india_in_202223 urban_india_in_202223 cluster
0 cereal 6.90 4.49 1
6 vegetables 5.26 3.76 1
9 egg, fish & meat 4.80 3.54 1
15 toilet articles & other household consumables 5.01 4.93 1
18 medical (non- hospitalization) 4.66 3.96 1
20 consumer services excluding conveyance 4.96 5.86 1
24 clothing & bedding 5.18 4.62 1

Cluster#Noise Points

item_group rural_india_in_202223 urban_india_in_202223 cluster
5 milk and milk products 8.14 7.15 -1
12 beverages, processed food#, etc. 9.41 10.53 -1
14 fuel and light 6.51 6.20 -1
16 education 3.23 5.73 -1
19 conveyance 7.38 8.51 -1
22 rent 0.76 6.49 -1
26 durable goods 6.79 7.13 -1

Comparing models.

model no_of_clusters slhhoutte_score
Kmeans 5 0.48
agglomorative 5 0.48
DBSCAN 3 0.42

Model Summary:

  • K-means and Agglomerative Clustering have very similar results. They both found 5 clusters with an identical silhouette score of 0.48.

  • Silhouette score is a measure of how well data points are separated within their assigned clusters. A score closer to 1 indicates better separation. While 0.48 isn’t a perfect score, it suggests an acceptable cluster separation.

  • DBSCAN identified a different cluster structure (3 clusters) with a lower silhouette score (0.42). This could indicate that DBSCAN might not be ideal for this particular dataset.

Conclusion

This analysis of MPCE contributions across different item groups identified several key patterns:

  • Cluster 2: Stands out with the highest overall contribution percentage (around 8%) in both rural and urban areas.

  • Clusters 0, 1, and 3: Show similar contribution patterns across rural and urban areas, suggesting these item groups don’t have significant spending differences between locations.

Cluster 0: Includes cereals, vegetables, various food items, fuel & light, and basic necessities.

Cluster 1: Focuses on staples like cereals and pulses, along with sugar, salt, fruits, and entertainment.

Cluster 3: Groups fresh fruits, edible oils, spices, medical hospitalization, and tobacco.

Cluster 4: Uniquely consists of “rent,” with a noticeably higher contribution from MPCE in urban areas compared to rural areas.

Question-2:

A large Insurance company is facing a serious problem of Low revenue, high insurance payout, customer dissatisfaction/ discontinuance as well as stiff competition in the market. You were recently recruited to understand the problem in depth, analyze the already available data to help the management come up with improvement actions which can be implemented and also develop suitable predictive models.

The past data on 500 customers available with the insurance company is given in the excel file (Insurance Data-Supervised Learning). Answer the following questions using the data given in the excel file.

A. Perform basic descriptive analytics and visualization on the Insurance Dataset and give your inferences for the management to initiate improvement actions.

B. Customer Lifetime Value (CLTV) is a very important indicator for any insurance company. It is the amount the company could profit from the customer from the customer is taking the services of the insurance company.

C. Using CLTV as the target variable perform exploratory analytics to identify the most important features contributing significantly to the variation observed in CLTV.

D. Build 3 most appropriate machine learning algorithms which can be best tried/attempted to build a predictive model for future customers.

E. The insurance company management is seriously concerned about the failure to retain customers (discontinuance) and especially the high CLTV customers. Perform exploratory analysis to identify the important features leading to the discontinuance.

F. Build 3 most appropriate machine learning algorithms(classifiers) which can be used to predict whether a customer will discontinue or not.Suggest an optimal model (s) for predicting the CLF and Discontinuance.

Data Preparation

Data Importing and Glance at Data

region location_code gender marital_status employment_status education income vehicle_class vehicle_size monthly_insurance_premium_ ... number_of_open_complaints number_of_policies policy_type policy_level renew_offer_type sales_channel coverage total_claim_amount customer_lifetime_value discontinued
0 South Suburban M Married Retired College 26806 Four-Door Car Medsize 63 ... 0 2 Corporate Auto Corporate L3 Offer1 Agent Basic 302.400000 7019 Yes
1 East Rural M Married Employed Bachelor 35322 Luxury Car Small 244 ... 0 5 Personal Auto Personal L3 Offer1 Agent Extended 76.826503 19185 No
2 South Urban F Married Employed College 46855 SUV Small 119 ... 0 9 Personal Auto Personal L3 Offer3 Branch Basic 472.391125 9501 No
3 South Suburban M Married Retired High School or Below 19683 Sports Car Medsize 117 ... 1 9 Personal Auto Personal L2 Offer1 Web Basic 561.600000 8383 Yes
4 East Suburban F Divorced Disabled Bachelor 19864 Four-Door Car Medsize 63 ... 0 1 Corporate Auto Corporate L3 Offer1 Web Basic 352.394515 2359 Yes

5 rows × 22 columns

Data Type Checkings and Glance at Data.

region                            object
location_code                     object
gender                            object
marital_status                    object
employment_status                 object
education                         object
income                             int64
vehicle_class                     object
vehicle_size                      object
monthly_insurance_premium_         int64
months_since_last_claim            int64
months_since_policy_inception      int64
number_of_open_complaints          int64
number_of_policies                 int64
policy_type                       object
policy_level                      object
renew_offer_type                  object
sales_channel                     object
coverage                          object
total_claim_amount               float64
customer_lifetime_value            int64
discontinued                      object
dtype: object

Checking for Missing Values.

Inference

No Nulls and Empty values are present in the given dataset features and it will be easier to carry out the statistical analysis/modeling. not required to eliminate any of observations.

2. (A and B) Perform basic descriptive analytics and visualization on the Insurance Dataset and give your inferences for the management to initiate improvement actions.

Understanding Categorical and Continous Features.

There are 14 categorical and 8 continous variables are present in the insurance dataset.

Categoical features:

region location_code gender marital_status employment_status education vehicle_class vehicle_size policy_type policy_level renew_offer_type sales_channel coverage discontinued
0 South Suburban M Married Retired College Four-Door Car Medsize Corporate Auto Corporate L3 Offer1 Agent Basic Yes
1 East Rural M Married Employed Bachelor Luxury Car Small Personal Auto Personal L3 Offer1 Agent Extended No
2 South Urban F Married Employed College SUV Small Personal Auto Personal L3 Offer3 Branch Basic No
3 South Suburban M Married Retired High School or Below Sports Car Medsize Personal Auto Personal L2 Offer1 Web Basic Yes
4 East Suburban F Divorced Disabled Bachelor Four-Door Car Medsize Corporate Auto Corporate L3 Offer1 Web Basic Yes

Continous features:

income monthly_insurance_premium_ months_since_last_claim months_since_policy_inception number_of_open_complaints number_of_policies total_claim_amount customer_lifetime_value
0 26806 63 32 19 0 2 302.400000 7019
1 35322 244 4 33 0 5 76.826503 19185
2 46855 119 13 94 0 9 472.391125 9501
3 19683 117 17 41 1 9 561.600000 8383
4 19864 63 22 96 0 1 352.394515 2359

Exploratory Data Analysis.

EDA-1: what are the policy type ?

Inferences

  • Personal Auto policy type propportion is higher than the corporate and special auto.

EDA-2: Which type of customers are taking the policies?

inferences

Employeed persons are only taking the pocilies. and there is a tiny contributions from unemployeed group, the remaining retired, disables and medical leave are equally takig the policies.

EDA-3: From which region the policies are purchased ?

Inferences

  • South and West

EDA-4: Are the rural customers buying more policies than the suburban and urban ?

inferences

  • No, Sububran peoples are buying the more policies than Rural.

EDA-5: For which type of vehicles customers are taking the policies ?

Infeences

  • Customers are preferring to purchase policies for four door and two-door cars

EDA-6: Through which channel policy purchases are happening ?

Inferences

  • Agents are helping to bring more customers to buy policies

EDA-7: Whare are the policy coverage?

Inferences

  • Customers are taking only basic coverages

EDA-8:How many number of customers are discontinued ?

  • 40% of customers are discontinued.

EDA-9: How are the incomes of customers?

count mean std min 25% 50% 75% max
income 500.0 37700.586 29233.9678 0.0 14099.0 32591.0 62142.0 99845.0

inferences

  1. 50% of customers are falling in the income range of 32K and an average income of customers is 37K, mean is little higher than median as its a right skewed.

EDA-10: How are the claim amounts paid to customers?

count mean std min 25% 50% 75% max
total_claim_amount 500.0 429.300851 268.254306 4.845348 291.860184 381.91703 542.4 1778.4

Inferences - 75% of claim amounts are below the prices:545 - There are claims reached out more than 750-1750

EDA-11: any relation ship between num of policies hold and claim amounts?

Inferences

  1. No sort of relations, holiding 1 policy’s customers are claiming more.

EDA-12: Is there any difference i CLTV of contiued and discontinued customers?.

Inferences

Discontinued customers and continued customers lifetime value are differents.

StatisticalAnalysis-Hypothesis testings.

Hypothesis 1:

Hypothesis

        Ho: The mean of customer_lifetime_value is 8500
        Ha: The mean of customer_lifetime_value is not 8500
P-Value Caluclated is:0.702 which is greater than to 0.05, hence we failed to reject the null hypothesis.

inferences - The mean value of customer lifetime value is about 8500**

Hypothesis 2:

Hypothesis

        Ho: There is no difference between mean of customer_lifetime_value contined and discontinued
        Ha: There is a difference between mean of customer_lifetime_value contined and discontinued
P-Value Caluclated is:0.151 which is greater than to 0.05, hence we failed to reject the null hypothesis.

infereces

We have significant evidence to say that there is no difference between mean of customer_lifetime_value contined and discontinued

Hypothesis 3:

Hypothesis

        Ho: The true proportion of discontied customers is 40
        Ha: The true proportion of discontied customers is not 40
P-Value Caluclated is:1.0 which is less than to 0.05, hence we can reject the null hypothesis.

Inferences

  • The true proportion of discontined customers is 40%.

Hypothesis 4: ChiSuare Associaton test: vehicle_class and Discontinued

Hypothesis

        Ho: There is no association between vechicle class and discontinued
        Ha: There is an association between vechicle class and discontinued

P-Value:0.05088 is greater than 0.05 hence we failed to reject the null hypothesis.

Inferences

  • Customer discontinuation rates appear independent of the vehicle class they choose.

2.(C AND D): Regression Problem.

Using CLTV as the target variable perform exploratory analytics to identify the most important features contributing significantly to the variation observed in CLTV.

Build 3 most appropriate machine learning algorithms which can be best tried/attempted to build a predictive model for future customers

Pre-Modeling Activities: Feature Creation and Encoding and Relationship checks

Field:region has levels:5
Field:location_code has levels:3
Field:gender has levels:2
Field:marital_status has levels:3
Field:employment_status has levels:5
Field:education has levels:5
Field:vehicle_class has levels:6
Field:vehicle_size has levels:3
Field:policy_type has levels:3
Field:policy_level has levels:9
Field:renew_offer_type has levels:4
Field:sales_channel has levels:4
Field:coverage has levels:3
Field:discontinued has levels:2

Our dataset contains 14 categorical features, each with more than one level. To encode these features for machine learning analysis, we will employ One-Hot Encoding.

  • After one-hot encoding the categorical variables, we now have separate features for each category within those variables. These encoded features, along with the original non-categorical features, will be used to create the X_predictor (features used for prediction) and Y_target (variable to be predicted) for our model.
income monthly_insurance_premium_ months_since_last_claim months_since_policy_inception number_of_open_complaints number_of_policies total_claim_amount region_South region_East region_West ... renew_offer_type_Offer1 renew_offer_type_Offer3 renew_offer_type_Offer2 sales_channel_Agent sales_channel_Branch sales_channel_Web coverage_Basic coverage_Extended discontinued_Yes discontinued_No
0 26806 63 32 19 0 2 302.400000 1 0 0 ... 1 0 0 1 0 0 1 0 1 0
1 35322 244 4 33 0 5 76.826503 0 1 0 ... 1 0 0 1 0 0 0 1 0 1
2 46855 119 13 94 0 9 472.391125 1 0 0 ... 0 1 0 0 1 0 1 0 0 1
3 19683 117 17 41 1 9 561.600000 1 0 0 ... 1 0 0 0 0 1 1 0 1 0
4 19864 63 22 96 0 1 352.394515 0 1 0 ... 1 0 0 0 0 1 1 0 1 0

5 rows × 51 columns

0     7019
1    19185
2     9501
3     8383
4     2359
Name: customer_lifetime_value, dtype: int64

Inferences

  • Weak Correlations: The analysis reveals that none of the correlation coefficients between features and the target variable exceed 0.4 in either positive or negative direction. This suggests that no single feature has an extremely strong linear relationship with the target variable.

  • Positive Correlations with Target Variable: Among continuous variables, monthly income premium and a luxury vehicle class exhibit the strongest positive correlations with the target variable.

  • Negative Correlations with Target Variable: Conversely, unemployment status and having a basic coverage policy are negatively correlated with the target variable. This suggests that being unemployed and having basic coverage are associated with lower values for the target variable.

Basic Regression modeling.

Model Summary

Inferences

  1. The R-squared value of 0.29 and Adjusted R-squared of 0.21 suggest that the model explains a portion of the variation in the target variable (Y) that can be predicted by the predictor variables (X).

Regression Model Betas.

Inferences

  • The beta coefficients for half of the predictor variables are negative. In other words, a one-unit increase in these variables is likely to lead to a decrease in the value of the response variable, but the magnitude of that decrease is not specified here.

VIF.

Features having VIF values below 5.

index VIF
51 discontinued_No inf
40 policy_level_Corporate L1 inf
32 policy_type_Corporate Auto inf
33 policy_type_Personal Auto inf
34 policy_level_Corporate L3 inf
36 policy_level_Personal L2 inf
37 policy_level_Personal L1 inf
39 policy_level_Corporate L2 inf
35 policy_level_Personal L3 inf
50 discontinued_Yes inf
25 vehicle_class_Four-Door Car 113.41392
29 vehicle_class_Two-Door Car 78.71848
27 vehicle_class_SUV 39.47769
2 monthly_insurance_premium_ 33.12355
48 coverage_Basic 19.21469
28 vehicle_class_Sports Car 17.00781
49 coverage_Extended 9.07340
23 education_High School or Below 7.43945
21 education_College 6.82422
22 education_Bachelor 6.71496
18 employment_status_Employed 6.09022

Features having VIF values greater than or equal to 5 .

index VIF
7 total_claim_amount 4.46738
20 employment_status_Unemployed 4.28022
42 renew_offer_type_Offer1 4.08476
44 renew_offer_type_Offer2 3.70694
1 income 3.42983
24 education_Master 3.27873
10 region_West 3.07734
8 region_South 3.00389
12 location_code_Suburban 2.78891
13 location_code_Rural 2.60509
26 vehicle_class_Luxury Car 2.58534
9 region_East 2.45961
43 renew_offer_type_Offer3 2.41943
31 vehicle_size_Small 2.41205
41 policy_level_Special L2 2.40396
17 employment_status_Retired 2.37559
30 vehicle_size_Medsize 2.30278
45 sales_channel_Agent 2.25997
11 region_Central 1.92946
46 sales_channel_Branch 1.92263
19 employment_status_Disabled 1.88747
15 marital_status_Married 1.83136
47 sales_channel_Web 1.68305
16 marital_status_Divorced 1.66289
38 policy_level_Special L1 1.49380
14 gender_M 1.15027
3 months_since_last_claim 1.12905
6 number_of_policies 1.10814
4 months_since_policy_inception 1.10311
5 number_of_open_complaints 1.09775

Pre-Modeling Activities: Train/Test validation sets creations and Feature Scaling/Standardization.

Train/Test Validation

In statistical modeling, a crucial step involves assessing how well a model performs on unseen data.

  • This is achieved by splitting the data into two sets: a training set and a testing set.

  • The model is trained on the training set, allowing it to learn the underlying relationships within the data. Subsequently, the model’s performance is evaluated on the completely separate testing set.

  • This train-test split approach is a well-established technique to ensure a model’s generalizability, meaning its ability to perform well on new, unseen data that wasn’t used for training.

  • for this analysis i’m using 80-20 trin/test split approch as follows.

Trainset:(400, 51) and testing:(100, 51)

StandardScaler for Effective Scaling

  • In machine learning algorithms, the scale of our input data can significantly impact the training process and ultimately the model’s performance. This is because many algorithms rely on the distances or magnitudes between data points to make predictions.

  • If features have vastly different scales (e.g., one feature has values ranging from 0 to 10, while another ranges from 1000 to 5000), algorithms like Support Vector Machines (SVM) or k-Nearest Neighbors (k-NN) can be misled. Features with larger scales might dominate the distance calculations, even if they aren’t as predictive as features with smaller scales.

  • To address this issue and ensure all features contribute equally to the model’s learning process, This technique performs standardization, which transforms each feature by subtracting the mean and dividing by the standard deviation. The result is a new set of features where:

        The mean is 0 (centered around the mean)
        The standard deviation is 1 (normalized)
  • By applying standardization, all features are placed on a common scale, ensuring that algorithms treat them with equal importance during training.

  • I applied standard scaling to the data before using machine learning algorithms.

Experiments:Supervised MachineLearning Models.

Linear Regression.

Inferences

  • The R-squared values for the training set (0.25) and testing set (0.07) indicate a significant difference in the model’s performance between the data it was trained on and unseen data. This suggests that the model might be overfitting the training data, leading to a good fit on that specific data but poor generalizability to new data

Decision Trees.

DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Inferences

  • The R-squared values of 0.47 for the training set and 0.62 for the testing set are interesting. While a higher testing R-squared (0.62) might seem positive, it’s important to consider that a good model typically exhibits a similar or slightly lower R-squared on the testing set compared to the training set.

  • In our case, the training R-squared (0.47) suggests we could do improvement in the model’s ability to learn from the training data. Further investigation such as feature selection, addressing potential biases could be beneficial.

  • DecisionTrees Feature Importance gives us only 2 variables are having relative importance as showed in the above plot such as number of policies and monthly insurance premium.

RandomForest.

GridSearchCV(cv=3, estimator=RandomForestRegressor(),
             param_grid={'max_depth': [10, 20, 30],
                         'max_features': [2, 5, 8, 12, 15, 17, 21],
                         'n_estimators': [100, 200]},
             scoring='r2')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_depth=10, max_features=21, random_state=2024)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Inferences

  • The training and testing R-squared scores (0.94 and 0.72, respectively) reveal a substantial difference in performance, suggesting the model may be overfitting the training data.

Randomforest Regression suggests the below are the potential predictiors to predict the targe variables.

['number_of_policies',
 'monthly_insurance_premium_',
 'months_since_last_claim',
 'income',
 'total_claim_amount',
 'months_since_policy_inception',
 'number_of_open_complaints',
 'vehicle_class_Sports Car',
 'vehicle_class_SUV',
 'vehicle_class_Luxury Car',
 'policy_level_Personal L1',
 'gender_M',
 'coverage_Basic',
 'vehicle_class_Four-Door Car',
 'marital_status_Married',
 'renew_offer_type_Offer1']

To gain a deeper understanding of the relationships between specific features and the target variable, I plan to create a new dataset containing only the following variables listed above and I will then conduct experiments using various regression models to assess their effectiveness in predicting the target variable based on this subset of data.

Data with subset of predictors.

Trainset:(400, 16) and testing:(100, 16)
  • As we can see in the subset of data we do have only 16 predictors.

Experiment-2: DecisionTrees

DecisionTreeRegressor(max_depth=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Inferences

  • The training R-squared (0.47) and testing R-squared (0.62) are similar to our previous results, even after removing features from the dataset. This suggests that decision trees are still capturing important patterns in the data, even without all the original features.

Experiment-2: RandomForest

RandomForestRegressor(max_depth=30, max_features=15, n_estimators=200,
                      random_state=2024)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Inferences - While training and testing R-squared aren’t directly comparable, the current model still seems overfitted. This is because the testing R-squared remains high even after removing irrelevant features from the random forest.

Experiment-2: LinearRegression

Inferences

  • Both the original model (without feature removal) and the current model (with feature removal) achieved similar R-squared values in training (0.23) and testing (0.19).

Experiment-2: Boosting-Regression

XGBRFRegressor(base_score=None, booster=None, callbacks=None,
               colsample_bylevel=0.8, colsample_bynode=0.98,
               colsample_bytree=0.62, device=None, early_stopping_rounds=None,
               enable_categorical=False, eval_metric=None, feature_types=None,
               gamma=1.5, grow_policy=None, importance_type=None,
               interaction_constraints=None, learning_rate=0.8, max_bin=None,
               max_cat_threshold=None, max_cat_to_onehot=None,
               max_delta_step=None, max_depth=6, max_leaves=None,
               min_child_weight=None, missing=nan, monotone_constraints=None,
               multi_strategy=None, n_estimators=2200, n_jobs=None,
               num_parallel_tree=None, objective='reg:squarederror', ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Infereces

  • The XGBRegressor achieved a training R-squared of 0.51 and a testing R-squared of 0.45. This performance suggests it might be a better choice compared to the other 3 models.

I will also take a look at other metrics such as MAE, RMSE of these models and see.

Regression Conclusion

The table summarizes the performance of two different machine learning models (XGBoost and Trees) on a regression task. The table shows the R-Squared, Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) for both the training and testing sets of each model.

Model Type R-Squared MSE MAE RMSE
xgb traininig 0.50 273333475.39 2697.63 5228.14
xgb testing 0.45 24882119.63 3223.14 4988.19
Trees traininig 0.47644 29179327.39 3001.25 5401.78
Tress testing 0.62 17118030.90 2748.26 4137.39

Here’s a breakdown of the metrics for each model:

XGBoost:

Training: XGBoost achieved a training R-Squared of 0.50, which indicates a moderately good fit on the training data. The MSE is 273,333,475.39, the MAE is 2,697.63, and the RMSE is 5,228.14. These values suggest that the model is able to capture some of the underlying patterns in the training data, but there is still some room for improvement.

Testing: The testing R-Squared is 0.45, which is lower than the training R-Squared. This suggests that the model may be overfitting the training data to some extent. The MSE is 248,821,196.30, the MAE is 3,223.14, and the RMSE is 4,988.19. These values are all slightly lower than the corresponding values for the training set, but the overall trend suggests that the model may not generalize well to unseen data.

Trees:

Training: The Trees model achieved a training R-Squared of 0.476, which is slightly lower than the training R-Squared of XGBoost. The MSE is 291,793,273.39, the MAE is 3,001.25, and the RMSE is 5,401.78. These values suggest that the Trees model may be slightly less accurate than XGBoost on the training data.

Testing: Surprisingly, the Trees model achieved a higher testing R-Squared of 0.62 compared to XGBoost. This could be an indication of overfitting on the training data for XGBoost or underfitting for Trees. The MSE is 171,180,309.00, the MAE is 2,748.26, and the RMSE is 4,137.39. These values are all lower than the corresponding values for both the training set and the XGBoost testing set. This suggests that the Trees model may be generalizing better to unseen data.

2.(E AND F): Classification Problem.

Trainset:(400, 50) and testing:(100, 50)

LogisticRgression

LogisticRegression(random_state=2024)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
--------------------------------------
Logistic-training:ConfusionMatrix and ROC-AUC Curve
--------------------------------------

Accuracy ROC-AUC F1-Score Precision Recall
Logistic-training 0.7675 0.74896 0.69307 0.73427 0.65625
--------------------------------------
LogisticRegression-Testing::ConfusionMatrix and ROC-AUC Curve
--------------------------------------

Accuracy ROC-AUC F1-Score Precision Recall
LogisticRegression-Testing: 0.7 0.65833 0.54545 0.69231 0.45

DecisionTrees-Classification

DecisionTreeClassifier(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Cost Complexity Pruning.

In the graph above, we see that the accuracy for the Testing Dataset hits its maximum value when alpha is about 0.014. After this value for alpha, the accuracy of the Training Dataset drops off and that suggests we should set ccp_alpha=0.014.

Decision tree pruning.

--------------------------------------
DecisionTree-Pruned-Traininig::ConfusionMatrix and ROC-AUC Curve
--------------------------------------

Accuracy ROC-AUC F1-Score Precision Recall
DecisionTree-Pruned-Traininig: 0.67 0.58854 0.30526 0.96667 0.18125
--------------------------------------
DecisionTree-Pruned:Testing:ConfusionMatrix and ROC-AUC Curve
--------------------------------------

Accuracy ROC-AUC F1-Score Precision Recall
DecisionTree-Pruned:Testing 0.66 0.575 0.26087 1.0 0.15

RandomForest-Classification

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [5, 8, 10], 'max_features': [5, 7],
                         'n_estimators': [100, 200]},
             scoring='accuracy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
{'max_depth': 8, 'max_features': 7, 'n_estimators': 100}
RandomForestClassifier(max_depth=10, max_features=7, random_state=2024)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
--------------------------------------
Baggin-Training:ConfusionMatrix and ROC-AUC Curve
--------------------------------------

Accuracy ROC-AUC F1-Score Precision Recall
Baggin-Training 0.9975 0.99688 0.99687 1.0 0.99375
--------------------------------------
Bagging-Testing:ConfusionMatrix and ROC-AUC Curve
--------------------------------------

Accuracy ROC-AUC F1-Score Precision Recall
Bagging-Testing 0.84 0.80417 0.75758 0.96154 0.625

XGB-Classification

100%|██████████| 1/1 [00:18<00:00, 18.16s/trial, best loss: -0.6674148056708936]
{'booster': 1,
 'colsample_bylevel': 0.546007358061281,
 'colsample_bynode': 0.5187931176469971,
 'colsample_bytree': 0.7322586366977841,
 'gamma': 8.110007681126655,
 'learning_rate': 0.021701657654466605,
 'max_depth': 3.0,
 'n_estimators': 300.0,
 'reg_lambda': 9.465235229189696,
 'subsample': 0.6343753989670202}
XGBClassifier(base_score=None, booster='dart', callbacks=None,
              colsample_bylevel=0.5, colsample_bynode=0.5, colsample_bytree=0.7,
              device=None, early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=8.11,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.02, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=3, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=300, n_jobs=None,
              num_parallel_tree=None, random_state=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
--------------------------------------
Boosting-Training:ConfusionMatrix and ROC-AUC Curve
--------------------------------------

Accuracy ROC-AUC F1-Score Precision Recall
Boosting-Training 0.655 0.56979 0.25 0.95833 0.14375
--------------------------------------
Boosting-Testing:ConfusionMatrix and ROC-AUC Curve
--------------------------------------

Accuracy ROC-AUC F1-Score Precision Recall
Boosting-Testing 0.64 0.55 0.18182 1.0 0.1

Classification ModelSummary

model training-Accuracy testing-Accuracy
LogisticRegression 0.76 0.72
Tree 0.67 0.66
Bagging 0.99 0.84
Boosting 0.65 0.64

1. Logistic Regression: This model achieved a good training accuracy of 76% but dropped slightly to 72% on testing data. This indicates a reasonable fit on the training data but suggests some room for improvement in generalizing to unseen data.

2.Tree: This model has a lower training accuracy (67%) compared to Logistic Regression. Its testing accuracy is also lower (66%) suggesting it might not be capturing the underlying patterns in the data as effectively.

3.Bagging: This model shows a very high training accuracy (99%). However, the testing accuracy drops significantly to 84%. This is a classic sign of overfitting. The model memorized the training data too well and performs poorly on unseen data.

4.Boosting: This model has the lowest training accuracy (65%) among all. However, its testing accuracy is similar (64%). This suggests the model might not be the most accurate but it generalizes reasonably well to unseen data.

In a conclusion - Logistic Regression seems like a balanced choice with decent performance on both training and testing data.

  • Tree might need further exploration or tuning to improve its accuracy.

  • Bagging suffers from severe overfitting and should not be used without further adjustments.

  • Boosting, despite lower training accuracy, generalizes well and could be a good option depending on the specific needs.

Suggest an optimal model (s) for predicting the CLF and Discontinuance.

  • XGBoost Regressor proved to be the most effective model for predicting the continuous CLF target variable. Additionally, XGBoost Classifier emerged as the best model for classifying customer churn, meaning it can best predict whether a customer will discontinue their service or not.

END